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Abstract 

We give a Large Deviation Principle (LDP) with explicit rate function 
for the distribution of vertex degrees in plane trees, a combinatorial model 
of RNA secondary structures. We calculate the typical degree distribu- 
tions based on nearest neighbor free energies, and compare our results with 
the branching configurations found in two sets of large RNA secondary 
structures. We find substantial agreement overall, with some interesting 
deviations which merit further study. 

1 Introduction 

In this paper we give a Large Deviation Principle (LDP) for a combinatorial 
model of RNA secondary structures. This mathematical result allows us to 
make quantitative statements about the expected or "typical" branching con- 
figurations for our model of RNA folding. We are motivated by the question 
of identifying "unusual" substructures in large RNA molecules, which is a cru- 
cial aspect of searching for putative functional motifs. This is a challenging 
biological question, particularly for lengthy RNA sequences whose size is prob- 
lematic for most existing computational approaches. We address one aspect of 
this problem by investigating the asymptotic branching degrees of large ran- 
dom trees under distributions which reflect the thermodynamics of RNA base 
pairing. 

Previous combinatorial results [10] on plane trees suggest that the degree 
of loop branching is correlated with thermodynamic stability and functional 
significance. We refine this analysis of the branching degree in RNA secondary 
structures by considering Gibbs distributions based on the nearest neighbor free 
energy parameters. We are particularly interested in the interplay between the 
energy term, which has dominated previous analyses, and the impact of entropy 
considerations in determining "unusual" configurations. Our mathematical re- 
sults are given as an LDP for the distribution of vertex degrees among plane trees 
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with N vertices. To the best of our knowledge, no studies of Gibbs distributions 
on random trees have been published, and our analysis of the energy-entropy 
competition for these random trees model appears to be new. We also compare 
our expected configurations as N — > oo with the branching degrees found in 
two sets of RNA secondary structures: large subunit 23S ribosomal structures 
derived by comparative sequence analysis from the Gutell Lab at UT Austin 
and picornaviral structures predicted by free energy minimization from the Pal- 
menberg Lab at UW Madison. We find substantial agreement overall between 
our asymptotic results for large random trees and the branching distributions 
found in the RNA secondary structures. This supports our statistical mechan- 
ics approach to developing a reasonable and mathematically tractable model of 
large RNA molecules. Conversely, deviations from our predictions indicate an 
aspect of RNA folding which is not well covered by the model and which merits 
further study. 

2 Overview 

A single-stranded RNA sequence encodes molecular structure and function in a 
hierarchical way |21j . from primary sequence through secondary structural to 
the tertiary interactions that determine the three-dimensional structure. Since 
the primary structure of an RNA molecule is a nucleotide sequence much like 
DNA, experimental sequencing techniques can easily determine its base compo- 
sition, and there are ever-increasing numbers of known RNA sequences. RNA 
molecules also resemble proteins though, since unlike the canonical DNA double 
helix, different RNA sequences fold into a variety of three-dimensional struc- 
tures. However, there are still only a few hundred solved RNA structures, 
largely small molecules or molecular fragments, in contrast to the thousands 
of known protein structures. Thus, understanding the relationship between an 
RNA sequence and the base pairings of its secondary structure is an essen- 
tial step in understanding the RNA structure-function hierarchy. Beyond the 
computational problem of RNA secondary structure determination, there is the 
question of evaluating the significance of the base pairings. In particular, iden- 
tifying "unusual" substructures in large RNA molecules is a crucial aspect of 
searching for putative functional motifs. 

We begin addressing this problem by investigating the typical branching 
configurations of large RNA molecules using a statistical mechanics approach 
with a combinatorial model of RNA folding. As detailed in [TUHH], trees are 
widely used to represent nested RNA secondary structures, and as described in 
Section [3] we model the folding of RNA sequences using plane trees - ordered, 
rooted trees [20] which nicely abstract the different substructures in RNA fold- 
ing. In Section 2] we consider the set of all plane trees on N vertices and define a 
Gibbs distribution on that set using energy functions from the nearest neighbor 

1 There is a large body of literature on protein secondary structures (amino acid alpha 
helices and beta sheets). However, this is unrelated to the nucleotide base-pairing pattern 
that constitutes an RNA secondary structure. 
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free energy model for RNA folding. We analyze these distributions as N — ► oo, 
and give an LDP with explicit rate function. 

Informally, an LDP with nonnegative rate function I for random variables 
Xm taking values in a set M. means that for all p S M. and large N, we have 

P{X N np}Ke- NI M. 

In particular, when the minimal value is attained by I at a unique point 
p* 6 M, then for any neighborhood O oip*, the probability P{Xn 4- O} decays 
exponentially in N. This can also be restated as a Law of Large Numbers with 
exponential convergence in probability to the limit point p* . 

As a consequence of this Law of Large Numbers, it makes sense to call 
a random tree from our model "typical" if the distribution of its branching 
degrees is close to p* . More precisely, the LDP for our model tells us that there 
is a distribution p* of branching degrees such that the distribution for a random 
tree is close to p* with probability approaching 1 as the size of the tree grows 
to infinity. Therefore, it also makes sense to consider any tree with a branching 
degree distribution considerably deviating from p* to be exotic. In Section [5] 
we compute p* , the asymptotically most probable branching sequences for our 
model. An immediate implication is that it is unlikely (in the framework of 
our model) to observe a large RNA secondary structure with branching degree 
distribution that significantly differs from p* . However, if such conformation is 
observed, the analysis of that conformation should result in some new insights. 

Under the nearest neighbor thermodynamic model for RNA folding, the free 
energy of an RNA secondary structure is assumed to be the independent sum 
of the substructure free energies. In our model of RNA branching configura- 
tions, this corresponds to an assumption that the free energy of the entire tree 
is equal to the sum of free energies associated with each vertex. However, it is 
known from statistical mechanics that free energy is additive if all the subcon- 
figurations are statistically independent of each other. If this requirement is not 
satisfied then additional entropy corrections related to the interdependencies or 
interactions between the subsystems or subconfigurations should appear. 

We show that this is indeed the case for the systems that we consider. The 
combinatorial structure of the trees imposes certain restrictions on branching 
degrees that lead to their mutual statistical dependence which in turn induces 
certain entropy corrections. Due to this interplay between energy and entropy, 
the typical trees minimize the free energy corrected by the extra entropy term 
resulting from the combinatorics of plane trees, and do not minimize the energy 
plainly understood as sum of the energies of all the vertices. Therefore, the 
entropy correction is an important factor in determining the branching of typical 
large trees, which have a broader distribution of loop degrees than the exotic 
energy- minimizing configurations. 

Based on our results, we have that the percentage of high degree vertices 
in a typical large tree is exponentially decreasing, but positive. As we discuss, 
the exact rate of decay depends on the specific thermodynamic parameters, and 
there are interesting differences in the behavior of our model under the two sets 



3 



Figure 1: The secondary structure, generated by the infold Web Server avail- 
able through http://frontend.bioinfo.rpi.edu/zukerm/home.html, for a 
79 base fragment from the 3' UTR of the 7440 nucleotide RNA virus poliovirus 1- 
Mahoney, Genbank Accession No. J0228140 [IS] • The structure has two hairpin 
loops, two internal loops (one of which is a bulge loop of size 2), one branch- 
ing (multi) loop, and an external loop. The adjacent plane tree (rooted at the 
bottom) models the configuration of the RNA secondary structure, preserving 
information about the basic arrangement of loops/vertices and helices/edges. 

of energy values considered. In Section [6l we compare these asymptotic degree 
distributions with the branching found in a set of ribosomal and a set of pi- 
cornaviral RNA secondary structures. There are definite qualitative similarities 
between our predictions and the secondary structure data, as well as various 
differences which suggest areas for future investigations. 

3 Modeling RNA folding by trees 

As pictured in Figure [TJ RNA secondary structures can be modeled as trees 
by collapsing each single-stranded loop into a point and replacing the stacked 
base pairs by an edge connecting two such points. The tree is rooted at the 
vertex corresponding to the external loop, which contains the 5' and 3' ends of 
the sequence, and by imposing a linear ordering on the vertices of the tree, we 
maintain the 5' to 3' orientation of the RNA molecule. Such an ordered, rooted 
tree, known as a plane tree [5D], gives a "low-resolution" model of RNA folding; 
it preserves information about the basic arrangement of loops and helices in 
an RNA secondary structure, and also captures certain essential elements of 
the free energy thermodynamic model. The free energy of a particular RNA 
secondary structure is calculated as the independent sum over the energies of 
well-defined substructures , namely the helices and different classes of loop 
structures. The primary loop classification is according to the number of base 



Name 


Branching degree 


dG 2.3 


dG 3.0 


Hairpin 





3.5 


4.10 


Internal 


1 


3.0 


2.3 


Branching 


d > 2 


4.6 - 0.2 (d + 1) 


3.4 - 1.5 (d + 1) 



Table 1: Loop structures and associated free energies at 37° pT4t [24] . 



pairs, that is according to the branching degree (the number of children) of 
the corresponding vertex. Since we consider only the branching degree of the 
vertices in our rooted trees, we will frequently refer to the number of children 
simply as the degree of the vertex. Hence, there are three basic types of loops 
which we consider with the associated free energies given in Table [TJ For our 
purposes, we consider bulge loops to be a special type of internal/degree 1 loop, 
loops with degree > 2 are called "branching" loops rather than "multiloops", 
and the exceptional energy function for the external loop is disregarded. 

Here, we consider two possible energy values for each type of loop structure, 
corresponding to the current standard known as dG 3.0 and the former standard 
dG 2.3. (See "Version 3.0 free energy parameters for RNA folding at 37°" and 
"Version 2.3 free energy parameters for RNA folding at 37°" available through 
the mf old website.) The energy of a loop is a function of the number of single- 
stranded bases and the number of base pairs, with an additional dependency for 
the stacking interactions . For the purposes of our model, we have chosen 
a specific energy value from the unbounded set of possibilities for each of the 
three types of loops. These values correspond to loops where any enclosed 
base pairs are G - C, the closing base pair is C - G, and the single-stranded 
segments are A i . These loops occur in the combinatorial model of RNA folding 
previously considered in [TO], and the dG 3.0 thermodynamic values were used 
in the results on RNA branching degrees given there. Clearly, there are many 
other possible choices and it may be interesting to investigate the impact of 
different thermodynamic values - energy minimizing versus maximizing, average 
against frequent, etc. - on the behavior of the model. We note that the dG 
2.3 parameters were originally included in our analysis because the picornaviral 
secondary structures from [18] which are analyzed in Section [5] were determined 
using those values. In doing so, though, we noticed interesting changes in the 
evolution of the free energy model. 

The free energy model is evolving in two significant ways. One type of devel- 
opment is extending and refining the experimental determination of thermody- 
namic values for the entropy and enthalpy of specific base interactions [131 Q3] . 
While this has improved the accuracy of RNA secondary structure prediction, 
it has also greatly increased the complexity of the thermodynamic calculations; 
the free energy model now includes more than 10,000 parameters, nearly all of 
which pertain to small internal loops. The other evolving component is changes 
in the estimation of free energy functions which have not, or worse cannot, be 
measured directly. The loop destabilizing energies are the most notable instance 
of this, and the major source of change between the previous energy parame- 
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ters (dG 2.3) and those currently used (dG 3.0). Through our mathematical 
results given in the next two sections, though, we can assess the impact of these 
changes and the importance of the entropy correction on the likely configura- 
tions of large RNA secondary structures without getting lost in the thousands 
of detailed thermodynamic parameters. 

4 The Large Deviation Principle 

As described above, we consider plane trees as our combinatorial model of RNA 
folding. Now, we introduce a family of Gibbs distributions on the trees, and 
state our main mathematical results. 

We fix a number D E N and for each N <G N consider the set IV (-D) of 
plane trees onJVeN vertices such that the number of children of each vertex 
(the branching degree) does not exceed D. We restrict ourselves to the trees 
with bounded degrees to simplify the mathematical treatment. However, if D is 
suitably large, this does not impose any significant restrictions since, although 
the degree of branching in RNA loops is theoretically unbounded, in practice it 
is necessarily limited by physical constraints. Moreover, as we shall see in the 
next section, the properties of the model stabilize as D — > oo. 

To define Gibbs distributions on Tjv(-D) we associate an energy with each 
plane tree. In our model of RNA branching configurations, we assume that the 
energy associated with each vertex depends only on its branching degree and 
is given by a function c : {0, 1, . . . , D} — > M. To a first approximation, this 
is consistent with the thermodynamics of RNA folding. The energy of a tree 
T G Tjv(-D) is then given by 

N D 

H(T) = J2c(d 3 (T))^Y, c ( k MT), (1) 

where dj denotes the branching degree of vertex j, and Xk(T) is the number of 
vertices with k children in T. Now the Gibbs probability measure on Tjy(D) 
associated with H is given by 

e -PH(T) 

Pn{T} = — , T€T N (D), 

an 

where (5 > is the inverse temperature parameter and Zn is a normalizing 
constant known as the partition function: 

Z N = e~ fm(T) - 

TGT N (-D) 

There are several interesting questions one could ask about the asymp- 
totic behavior of measures Pm as N — ► oo. Here we would like to study 
the frequencies of branching degrees, so for each N we introduce a probabil- 
ity measure l>n on [0,1] D+1 defined as the distribution of the random vector 
jj(Xo(T), Xi(T), ■ ■ ■ , xd{T)) under Pjy. Our main result is an LDP for v^. 
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Let us recall that a sequence of probability measures (/Ujv)j\reN on a compact 
metric space (E, p) satisfies an LDP with a nonnegative lower-semicontinuous 
rate function / : E — > R if 

limsup -J- lri/UjvCC) < — for any closed set C C E } 

and 

liminf — ln/ijv(0) > —1(0), for any open set O G E, 

N—i-oo I\ 

where for a set O, we denote 7(0) = inf pg o J(p), see [7J Section II. 3] or [5, 
Section 1.2]. 

Informally, an LDP means that if we consider random variables Xjy with 
distribution /itjv, then for all p and large N we have 



P{X N « p} « e 



-JVJ(p) 



In particular, if the minimal value is attained by 7 at a unique point p*, then 
for any neighborhood O of p*, l±n{O c ) — P{X ^ O} decays exponentially in TV. 
This can be restated as a Law of Large Numbers with exponential convergence 
in probability to the limit point p* . 

For our model, it is natural to formulate the LDP for un on the set 

M = Le[o,i] D+1 : 5>* = i, X>* = i] 

I fc=0 fc=0 J 

equipped with Euclidean distance. Though the random vector jj(xo, ■ ■ • > Xd) 
does not belong to A4, it is asymptotically close to M: 

d d 1 

^ N ^ N N 

k=0 k=0 

So instead of formulating an LDP for the sequence of random vectors (xo > ■ ■ ■ 
we shall formulate an LDP for a sequence of random vectors that is close to it 
and belongs to M. 

Let us introduce J : A4 — ► K via 

j{ P ) = pE(p) - Hp), 

where 

D 

Hp) = -J^Pk^Pk (2) 

fe=0 

is the entropy of the probability vector p — (po , . . . , pd ) , and 

D 



E( P ) = y^ Pk c(k) 



k=0 
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is the energy associated with p S M. 

The function J is strictly convex, and attains its minimum on M at a unique 
point p* . Let 

J(p) = J(p) - J(p*). (3) 

For a measure Q on [0, x M we define and as the marginal 

distributions of Q on [0, 1] £>+1 and M. respectively. 

Theorem 1 There is a sequence of probability measures (Qn)nen defined on 
[0, x M. with the following properties. 

1. For each N , we have = vn- 

2. For each N , 

Q N l(x,y) G [0,1} D+1 x M : -Vk\ > ^1 = 0- 

(2) 

3. The sequence (Q N JjveN satisfies LDP on M. with the rate function I 
defined in ([3]). 

Remark 1 This theorem says that though the random vector x/N does not 
belong to Ai, one can find another random vector that is, on the one hand, very 
close to x/N and on the other hand belongs to M. and satisfies the LDP. 

An immediate consequence is the following Law of Large Numbers: 
Corollary 1 As N — > oo, 



'Xo Xi Xd \ 
,JV' iV''"' N J 



V 



in probability. 



Remark 2 The statements above show that with high probability the degree 
frequencies are close to p*. Note that in most cases p* is not the minimizer of 
the energy E on M. 

We shall now give a sketch of the proof of Theorem [T] The proof is based 
on the fact that trees with equal branching degree sequences have equal energy. 
Therefore, 

Pn {x(T) = n} = V > , (4) 

An 

where n = (uq, . . . , no) and C(N, n) is the number of plane trees of order N 
with rifc nodes of branching degree k: 

If N \ 1 N\ 
C(N,n) = -{ 



N \n , ni, n 2 .--J N n !ni!n 2 ' 
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if m + 2ri2 + ■ ■ ■ = N — 1, and otherwise (see e.g. Theorem 5.3.10 in [20 ). One 
can apply the formula 

^0-«*{"(-£>£ + o(^))}. 

= exp{iV7i +0(hi7V)|, asN^oo. 

which holds true uniformly in n, see e.g. Lemma 1.4.4]. 
Plugging this into (0]), we get 

e -N[0E(%)-h(%)]+O(lnN) 

Zn 

e -Nj(%)+0(lnN) 

Zn 

which is the desired asymptotics. In fact, the LDP that we claim is a stronger 
statement and requires extra work to complete this argument rigorously. The 
complete proof along with other random tree models will appear in detail else- 
where pp. 



P tx(T)_ n\ 



5 Applications to RNA secondary structure 

In this section we compute the asymptotically most probable branching se- 
quences for our model under an additional requirement that the coefficients 
c(to) are given by 

(Ai, m = 0, 

c(m) = < A 2 , m = 1, 

[A3 -Aim, to > 2, 

for some numbers Ai, A2, A3, A4. Both the dG 2.3 and dG 3.0 thermodynamic 
values in Table [1] satisfy this requirement, and we shall address these models in 
detail in the end of this section. 
For this choice of c(m) we have 

D 

(3E(p) = aipo + a-2Pi + ^ (a 3 - a 4 m)p m , 

m=2 

where 

a i =^A i = - j ^, i = 1,2,3,4, 

k = 1.99 Cal/mole-K being the Boltzmann constant, and T the temperature. 

Corollary [1] implies that a typical conformation will have degree frequencies 
close to the solution of 

S(p) ~ * min, p e A4, 
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where 

D D 



s (p)= E Pm lnPm + aiPo + a2Pl + E (° 3 ~ a i m )P™- 



m— m— 2 



It is easy to see that, since the function x t— ► x In x has infinite negative derivative 
at zero, the minimal value of S(p) cannot be attained at the boundary of A4. 
Moreover, S is strictly convex, so that there is a unique minimizer. Therefore 
we can solve this problem by the method of Lagrange multipliers. We set 



d \ / D 



S{p, A) = S( P ) + A I P™ ~ 1 + Al I E TTiPm - 1 

\m=0 / \m=0 y 

The optimal vector (p* , A) must satisfy 



0=J-S(p*,X) = { 

Op m 



We rewrite this as 



ax + lnp2 + 1 + A , m = 0, 

a 2 + lnpf + l + A + Ai, m = 1, 

a 3 — a 4 m + lnp^ + 1 + A + mAi, m > 2. 



Pi = &2 V", ( 5 ) 
where /U = e _Ao_1 ,i/ = e~ Al and 6j = e a *,i = 1, ... ,4. We notice that 

i = E Pm = m ( ^r 1 + 62 ^ + &s 1 E (M ro J , 

m=0 \ m=2 / 

1 = e m v* m = a» ( 62 ^ + &s 1 E ™(M m J • 

m=0 \ m=2 / 

Instead of solving this system explicitly, let us consider the case of D » 1, i.e., 
rewrite the limiting system for Z? — ► 00: 



l = /i^ 1 + &2 1 V + &i' 1 Y 
1 = /i (&2"V + 6 3 



_1 V4V 



x 2blv 2 -b\v z 
(1 - b AV f 

Excluding \x we get a quadratic equation on v and among the two roots we 
choose 

Vh 
b 4 (Vh + Vh) 

that satisfies < b 4 v < 1. 
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Figure 2: The first 11 values of p m for both the dG 3.0 model and the dG 2.3 
model, where the right-hand graph shows the logarithm of the values. 



Now we can express fi as 

(-61 + \Zb 3 bi) 64 (-h + fri) b 2 bi 

— 62^463 Vhibi — 2 bi 2 b 2 b 4: + 3 b 2 b i biy/b 3 bi + 2 & 3 &i 2 — b^bi^/b^bl — bi 2 ^b 3 bi 



For the 3.0 model, we have A\ = 4.1 KCal/mole, A 2 = 2.3 KCal/mole, 
A 3 = 1.9 KCal/mole, A 4 = 1.5 KCal/mole at T = 273 + 37 = 310 K. Then the 
solution given above, yields 

v w 0.013, /i w 368.3. 

Likewise, for the dG 2.3 model, we have A\ — 3.5 KCal/mole, A 2 = 3.0 
KCal/mole, A 3 = 4.4 KCal/mole, A 4 = 0.2 KCal/mole at T = 273 + 37 = 310 
K. Then the solution given above, yields 

j/wO.46, /iw 121.3. 

The first several values of p m in both cases are displayed in Figure [2] 

The LDP for our model implies that, typically, the frequency of the loops of 
degree k decreases exponentially in k. However, the relative frequency for the 
first three terms and the exact rate of decay depends on the specific thermo- 
dynamic parameters. We know from previous results [lQj that the trees which 
minimize the associated free energies in the dG 3.0 model maximize the number 
of vertices of degree 2. We see a similar behavior in the asymptotic distribution 
of vertex degrees under our LDP with the dG 3.0 thermodynamic values; in a 
typical large tree, 47.8% of the vertices would have degree and 35.1% would 
have degree 2. Because of the impact of the entropy term correction, though, 
11.2% of the vertices would have degree 1, and a vanishingly small but still 
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nonzero percentage would be likely to have some degree > 3. Thus, under the 
dG 3.0 model, the frequency of branching degrees in a typical large tree is a 
refined, and certainly more reasonable, distribution which still resembles our 
original calculation of the energy- minimizing configurations. 

In contrast, the relative frequency among the vertices with degree 0, degree 
1, and degree > 2 is significantly different for the distribution calculated with 
the dG 2.3 values. Now, in a typical large tree, while 41.7% of the vertices would 
still have degree 0, only 5.5% would have degree 2, and 43.2% would have degree 
1. Furthermore, although the percentage of loops with degree > 3 still decreases 
exponentially, the rate is significantly lower than it was with the dG 3.0 values. 
The differences in the thermodynamic values are primarily a result of changes 
in the loop destabilizing energies for the hairpin and internal loops as well as 
more significant changes in the offset, free base penalty, and helix penalty for 
the multibranched loop energy function. In particular, the dG 2.3 values for the 
offset, free base penalty, and helix penalty are 4.60, 0.40, and 0.10 respectively, 
while the dG 3.0 values are 3.40, 0.0, .40. Intuitively, branching is significantly 
more favorable, energetically speaking, under the dG 3.0 thermodynamic model 
than it was in the dG 2.3 version. These changes then have a significant impact 
on the distribution among loops of small degrees as well as on the decay rate 
for the tail of the distribution. 

In our model, we are able to assess the impact of these changes on the distri- 
bution of branching degrees for a typical large tree. However, our low-resolution 
model of RNA folding does not permit any assessment of the correctness of the 
two thermodynamic models, as was done in a recent analysis [6 a . As we shall 
see, though, it is the dG 2.3 distribution, and not the dG 3.0 model, which more 
closely resembles the frequency of branching degrees in both the large subunit 
23S ribosomal and the picornaviral RNA secondary structures. 

6 Ribosomal and picornaviral branching degrees 

We analyze the branching degrees found in two different sets of RNA secondary 
structures, and compare them with the typical branching sequences for our 
large random trees. Our findings are summarized here in Figure [3] and in the 
discussion, while more details are given in Appendix [A] Overall, the branching 
of these secondary structures agrees with the results for our model, although 
there are deviations which suggest interesting avenues for further investigation. 
Our comparisons are qualitative, rather than quantitative, since it would be 
unrealistic to expect precise agreement between our "low-resolution" model of 
RNA folding and the branching configurations of large ribosomal and picornavi- 
ral secondary structures. Still, we find some striking similarities between the 
predictions based on our model and the data for real RNA sequences. 

The first set of results, found in Appendix IA.lt is f° r the large subunit 23S 
ribosomal RNA secondary structures determined through comparative sequence 
analysis by the Gutell Lab. We give results for 20 of the 77 pseudoknot-free 
sequences available online through their Comparative RNA Web (CRW) Site 
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Figure 3: The distribution of loop degrees as fractions of the total. Each graph 
shows both the averages over the data set, as given in Tables H and [TUl and the 
filtered averages, as given in Tables [71 and [Til after the smallest internal loops 
have been removed. 
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and Project [2J. The chosen sequences were also used in the analyses of [5] and 
are representative of the whole set. As seen in Table [2j the average sequence 
length is 2756.2 nucleotides, although there is certainly variability among the 
different types of ribosomal sequences. Since our results are asymptotic, we 
disregard the particular energy function for the external loop, and the degrees 
of the external loops are listed separately in Table [H 

In Tables [3] and [4] we give the distribution of loop degrees, where the degree 
of a loop is one less than the number of base pairs contained in the loop. We 
see that the most prevalent loops (46.81% overall) are the internal loops with 
degree 1, followed by the hairpin structures with degree 0. Most of the branching 
loops have degree 2, which agrees with the previous combinatorial analysis [TP] , 
although there is a distribution extending out to branching loops of degree 12. 
We note that the distribution of branching loops tails off much as we expected, 
although there is an interesting peak of degree 6 loops as well as smaller peaks 
at 4, 8, and of course 12. We find this correlation between loop parity and 
frequency interesting, although since ribosomal structure is highly conserved 
across various organisms, the distribution of loop degrees for these 23S RNA 
secondary structures are by no means independent. 

As we do for the picornaviral sequences, discussed below, we investigate in 
more detail the distribution of sizes among the internal loops. As we see from 
Table[5j with only a few exceptions, the internal loops contain fewer than 16 un- 
paired bases, and a substantial fraction (48.36% on average) contain at most 2. 
It is reasonable |25j to consider two helices which are interrupted by an internal 
loop of fewer than 3 bases as one contiguous stem. When we adjust the count 
of loop degrees accordingly, by excluding internal/degree 1 loops with at most 2 
unpaired bases as in Tables [6] and [3 then we see a distribution with different rel- 
ative numbers of hairpin/degree 0, internal/degree 1, and branching/degree > 2 
loops. For these 23S ribosomal secondary structures, our prediction branch- 
ing distributions for dG 2.3 are closer to the original unfiltered distribution, 
although the opposite will be true for the picornaviral secondary structures. 

The second set of results is found in Appendix IA.2I We consider the 1 1 
picornaviral sequences analyzed in [18] , which are available online from the Pal- 
menberg Lab through http://www.virology.wisc.edu/acp/RNAFolds. The 
predicted secondary structures were computed by the mf old program v2.2, us- 
ing the default values [18]. The average length for these sequences, as seen from 
Table[8j is 7566.27 bases - considerably longer than the large subunit 23S ribo- 
somal sequences. We also list the external loop degrees separately in Table 
since this special energy function is not considered in our asymptotic results. 

Again, the most prevalent loops have degree 1, as seen in Tablesl9landfT0l and 
the most common type of internal loops (48.14% on average) are those contain- 
ing at most 2 unpaired bases. However, the relative number of hairpin/degree 0, 
internal/degree 1, and branching/degree > 2 loops given in Tables [5] and [TU] dif- 
fers significantly from the LDP distribution. A large part of this deviation is 
resolved after further investigation into the distribution of internal loop sizes. 
As seen in Tables QT] and [12] there is a much broader distribution for the sizes 
of internal loops. While most contain fewer than 16 unpaired bases, the num- 
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bcr of "large" internal loops does not drop off as sharply for the picornaviral 
secondary structures as it did for the 23S ribosomal ones. When we filter the 
data by excluding the smallest internal loops, as in Tables [T3] and [T31 then we 
see a distribution that agrees even more closely with our LDP probabilities. In 
this case, though, we have nearly equal numbers of hairpin/degree loops and 
internal/degree 1 loops, while the numbers of branching/degree > 2 loops drop 
off almost by a factor of 2. Thus, the predicted picornaviral configurations are 
less extensively branched than the ribosomal secondary structures, and the de- 
gree of branching more closely agrees with our LDP probabilities for the dG 2.3 
model. 

7 Discussion of related results 

We adopt here a statistical mechanics approach, not to predict base pairs for a 
particular RNA sequence, but to analyze what a typical branching distribution 
might be for an arbitrary large RNA secondary structure. This work joins a 
growing body of results which analyze different general characteristics of RNA 
secondary structures, both theoretically [H [121 HS1 ttZ] an d computationally [3l 
El [H EH ESI [23]. The qualities investigated have been the free energy and 
molecular stability [3l E3 ES [22l [23] as well as the number and type of different 
substructural elements [H O E3 EE E] ■ Asymptotics of the expected maximum 
number of base pairs are studied in [lj , but the overall molecular configurations 
are not addressed. 

Statistics for different structural elements are computed for short RNA se- 
quences < 100 bases in [§] . The unfiltered distribution of picornaviral degrees 
agrees closely with their statistical reference probability densities, whereas the 
distribution of the 23S ribosomal degrees resembles their "natural" sequence 
distribution by having slightly more hairpin / degree loops and fewer inter- 
nal / degree 1 loops. The statistics of average branching degree given in |9J 
reflect the fact that for large RNA sequences the size ./V of the associated tree 
is, typically, also large. Therefore, the average branching degree is close to 1 
due to the identity 



This also agrees with the theoretical limit given in [12] ; the asymptotic average 
branching degree of 1 was derived for non-root vertices using a model of RNA 
secondary structures at the base level and complicated recursion formulae de- 
pending on n, the number of bases in the sequence. We have not yet investigated 
the other characteristics analyzed in [9] and [12] , however it may be possible to 
extend our low-resolution model of RNA folding and this statistical mechanics 
approach to other properties of RNA secondary structures. 

In [16] , the typical configuration of large subunit ribosomal RNA is investi- 
gated using a approach based on generating functions and stochastic context-free 
grammars. This approach yields explicit formulas for the frequency of different 
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structural elements as a function of the sequence length n. Using the average 
sequences lengths for the 23S ribosomal and picornaviral secondary structures 
as n\ and ni 1 we computed the predicted number of hairpin, internal, and 
branching loops as well as the average degree of a branching loop. As in [16j . we 
compare the averages from the RNA secondary structures and the predicted fre- 
quencies, and find reasonably good agreement for the 23S ribosomal structures. 
The relative differences for the predicted frequencies from the unfiltered 23S 
ribosomal averages are: —3.28% for hairpin loops, —13.05% for internal loops, 
1.09% for branching loops, —6.35% for the total number of loops, and 1.84% for 
the average branching degree. In contrast, the comparisons for the picornaviral 
secondary structures are not as good. The relative differences for the predicted 
frequencies from the unfiltered picornaviral averages are: 21.67% for hairpin 
loops, —31.46% for internal loops, 6.94% for branching loops, —10.52% for the 
total number of loops, and 19.42% for the average branching degree. Since the 
equations in [16j were derived by training the grammar on a database of large 
subunit ribosomal RNA, it is perhaps not surprising that the predictions of the 
model do not correspond as well to the picornaviral secondary structures. The 
paper [17] provides related results by considering a model of RNA folding where 
two bases pair with probability p and investigates different properties of the 
RNA secondary structures, but not does not include an analysis of branching 
degrees. 

8 Conclusions 

We considered Gibbs distributions for our plane tree model of RNA folding 
based on the nearest neighbor thermodynamics. An important feature of our 
model is that we can describe the typical branching configurations of the trees 
by calculating the asymptotic degree sequences via a Large Deviation Principle 
(LDP). As discussed, this has at least two implications for the branching of large 
RNA secondary structures, such as the large subunit 23S ribosomal molecules 
or RNA viral genomes like picornaviruses. 

One implication concerns the asymptotic distribution of vertex degrees in 
a large random tree from our model. The LDP for our model implies that, 
typically, the frequency of the loops of degree k decreases exponentially in k. 
The exact rate of decay depends on the specific thermodynamic parameters, 
however, and we considered two sets of energy values, the current standard 
dG 3.0 and the former standard dG 2.3. Surprisingly, we find that the typi- 
cal distribution based on the dG 2.3 parameters corresponds more closely to 
the branching degrees of both the picornaviral and ribosomal RNA secondary 
structures. The differences in the thermodynamic values are primarily a result 
of changes in the loop destabilizing energies for the hairpin and internal loops 
as well as more significant changes in the offset, free base penalty, and helix 
penalty for the multibranched loop energy function. These changes then have 
a significant impact on the distribution among loops of small degrees as well 
as on the decay rate for the tail of the distribution. To be able to distinguish 
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unusual substructures against the background of a typical configuration, we will 
need to understand better the impact of different thermodynamic values on the 
behavior of the model. 

A second implication to emerge from our current analysis is that combina- 
torial constraints lead to important entropy considerations in determining the 
most likely branching distributions in large random trees. The nontrivial com- 
binatorics of the plane trees implies that typical trees are minimizcrs of the free 
energy corrected by an extra entropy term. Thus, although the typical trees 
in the dG 3.0 model are structurally, and therefore energetically, related to the 
trees which have minimal energy, a typical large tree will not be a minimizer 
of the free energy understood as the sum of the energies of individual loops. 
In fact, the LDP tells us that, in our combinatorial model of RNA folding, the 
energy-minimizing trees are extremely improbable. Thus, when modeling the 
folding of large RNA molecules, it is important to include entropy considera- 
tions which distinguish the most likely configurations from those which simply 
minimize the additive free energy. 
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A Analysis of RNA Branching Degrees 
A.l 23S Ribosomal RNA 



Index 


Type 


Organism Name 


GcnBank Accession # 


Length 


Degree 


1 


a 


Haloarcula marismortui 


X13738 


2925 


1 


2 


a 


Thcrmc-coccus celer 


M67497 


3029 


1 


3 


b 


Thermotoga maritima 


M67498 


3023 


1 


4 


b 


Thermus thcrmophilus 


X12612 


2915 


1 


5 


b 


Borrclia burgdorferi 


M88330 


2926 


1 


6 


b 


Escherichia coli 


.101695 


2904 


1 


7 


b 


Pscudomonas aeruginosa 


Y00432 


2893 


1 


8 


b 


Bacillus subtilis 


K00637 


2927 


1 








AF008220 Z99119 






9 


b 


Mycobacterium leprae 


X56657 


3122 


1 


10 


c 


Chlamydomonas rcinhardtii 


X15727 


2902 


1 


11 


c 


Zca mays 


Z00028 


2985 


1 


12 


m 


Chlamydomonas cugametos 


AF008237 


1915 


13 


13 


m 


Saccharomyces cerevisiae 


.101527 


3273 


1 


14 


m 


Zea mays 


K01868 


3514 


6 


15 


m 


Cacnorhabditis elcgans 


X54252 


953 


8 


16 


m 


Drosophila mclanogaster 


X53506 


1335 


9 


17 


m 


Xcnopus laevis 


M10217 


1640 


12 


18 


e 


Giardia intcstinalis 


X52949 


2850 


10 


19 


e 


Saccharomyces cerevisiae 


U53879 


3554 


7 


20 


e 


Arabidopsis thaliana 


X52320 


3539 


12 



Table 2: Sequence information, including the degree of the external loop, for 
20 of the 77 pseudoknot-free 23S ribosomal RNA secondary structures from 
the CRW [2]- The 20 selected were also used in the analyses of [8], and are 
representative of the whole set. The different types of sequences are (a) Archae, 
(b) Eubacteria, (c) Choloroplast, (m) Mitochondria, and (e) Eucarya. 



Index 


sum 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 11 


12 


1 


197 


70 


93 


14 


9 


9 




2 












2 


191 


73 


88 


13 


7 


6 


1 


1 




1 






1 


3 


201 


72 


94 


15 


9 


9 




1 


1 










4 


192 


72 


91 


12 


7 


6 


1 


1 




1 






1 


5 


201 


71 


95 


15 


9 


9 




2 












6 


199 


70 


95 


14 


10 


8 




1 


1 










7 


199 


70 


95 


14 


10 


8 




1 


1 










8 


207 


71 


102 


14 


9 


9 




1 


1 










9 


205 


74 


100 


14 


7 


6 


1 


1 




1 






1 


10 


202 


70 


98 


14 


9 


9 




2 












11 


206 


71 


100 


15 


9 


9 




2 












12 


115 


49 


50 


6 


3 


5 


1 


1 












13 


157 


59 


73 


12 





3 


1 


2 










1 


14 


180 


65 


85 


15 


6 


6 


1 


2 












15 


49 


23 


18 


4 


1 


3 
















16 


77 


33 


33 


4 


1 


6 
















17 


101 


41 


46 


6 


3 


3 


2 














18 


190 


74 


82 


17 


7 


8 


1 


1 












19 


221 


80 


102 


21 


8 


8 




1 




1 








20 


218 


80 


102 


20 


7 


6 


1 


1 




1 








total 


3508 


1288 


1642 


259 


137 


136 


10 


23 


4 


5 








4 



Table 3: Degree distributions of loops. 
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Index 





1 


2 


3 


4 


5 


6 


■7 


8 


9 10 


11 


12 


1 


35.5 


47.2 


7.1 


4.6 


4.6 




1.0 












2 


38.2 


46.1 


6.8 


3.7 


3.1 


0.5 


0.5 




0.5 






0.5 


3 


35.8 


46.8 


7.5 


4.5 


4.5 




0.5 


0.5 










4 


37.5 


47.4 


6.2 


3.6 


3.1 


0.5 


0.5 




0.5 






0.5 


5 


35.3 


47.3 


7.5 


4.5 


4.5 




1.0 












6 


35.2 


47.7 


7.0 


5.0 


4.0 




0.5 


0.5 










7 


35.2 


47.7 


7.0 


5.0 


4.0 




0.5 


0.5 










8 


34.3 


49.3 


6.8 


4.3 


4.3 




0.5 


0.5 










9 


36.1 


48.8 


6.8 


3.4 


2.9 


0.5 


0.5 




0.5 






0.5 


10 


34.7 


48.5 


6.9 


4.5 


4.5 




1.0 












11 


34.5 


48.5 


7.3 


4.4 


4.4 




1.0 












12 


42.6 


43.5 


5.2 


2.6 


4.3 


0.9 


0.9 












13 


37.6 


46.5 


7.6 


3.8 


1.9 


0.6 


1.3 










0.6 


14 


36.1 


47.2 


8.3 


3.3 


3.3 


0.6 


1.1 












15 


46.9 


36.7 


8.2 


2.0 


6.1 
















16 


42.9 


42.9 


5.2 


1.3 


7.8 
















17 


40.6 


45.5 


5.9 


3.0 


3.0 


2.0 














18 


38.9 


43.2 


8.9 


3.7 


4.2 


0.5 


0.5 












19 


36.2 


46.2 


9.5 


3.6 


3.6 




0.5 




0.5 








20 


36.7 


46.8 


9.2 


3.2 


2.8 


0.5 


0.5 




0.5 








total 


36.72 


46.81 


7.38 


3.91 


3.88 


0.29 


0.66 


0.11 


0.14 








0.11 



Tabic 4: Degree distributions of loops as percentages. 











Number of internal loops 


with 


1 < 


size 


< 15 






List of large 


Index 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 13 


14 


15 


loop sizes 


1 


27 


13 


6 


7 


4 


7 


4 


6 


10 


4 


1 


1 1 


1 




37 


2 


26 


15 


2 


8 


5 


6 


5 


6 


9 


2 


1 


1 


1 




35 


3 


24 


17 


4 


10 


4 


10 


7 


4 


6 


4 


2 




1 




31 


4 


27 


17 


4 


11 


6 


8 


4 


3 


5 


2 


2 




1 




31 


5 


26 


18 


4 


10 


7 


8 


5 


4 


7 


2 


1 


1 


1 




30 


6 


24 


19 


3 


11 


8 


7 


3 


5 


9 


2 


2 




1 




30 


7 


26 


17 


5 


11 


6 


8 


5 


4 


7 


2 


2 




1 




31 


8 


32 


18 


4 


10 


7 


9 


5 


4 


6 


2 


2 




1 


1 


30 


9 


29 


19 


3 


10 


8 


9 


6 


3 


5 


3 


2 




1 


1 


31 


10 


29 


19 


3 


9 


6 


10 


4 


4 


7 


2 


2 




1 




29, 41 


11 


28 


18 


6 


10 


7 


6 


7 


5 


7 


1 


2 




1 




20, 29 


12 


15 


9 


4 


4 


1 


4 


2 


1 


2 


3 


1 






1 


18, 20, 27 


13 


27 


12 


4 


6 


3 


5 




4 


4 




2 


1 


1 




28, 36, 50, 64 


14 


26 


16 


4 


9 


3 


5 


5 


4 


7 


2 


1 


1 


1 




31 


15 


4 


8 


2 


1 


1 








2 














16 


10 


9 


3 


3 




1 




1 


4 












17, 20 


17 


15 


9 


5 


4 


2 


3 




1 


2 


1 




1 






18, 33, 47 


18 


27 


14 


4 


9 


2 


5 


5 


5 


3 


3 


1 


1 


1 




25, 27 


19 


34 


19 


3 


11 


5 


7 


7 


4 


5 


3 


1 


1 


1 




36 


20 


35 


17 


4 


9 


5 


7 


7 


3 


5 


3 


1 


1 


3 




16, 37 



Table 5: Number of internal loops of different sizes, given as the distribution of 
loops with at most 15 unpaired bases and as a list of large internal loop sizes 
with multiplicity. 
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Index 


sum 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


1 


157 


70 


53 


14 


9 


9 




2 














2 


150 


73 


47 


13 


7 


6 


1 


1 




1 








1 


3 


160 


72 


53 


15 


9 


9 




1 


1 












4 


148 


72 


47 


12 


7 


6 


1 


1 




1 








1 


5 


157 


71 


51 


15 


9 


9 




2 














6 


156 


70 


52 


14 


10 


8 




1 


1 












7 


156 


70 


52 


14 


10 


8 




1 


1 












8 


157 


71 


52 


14 


9 


9 




1 


1 












9 


157 


74 


52 


14 


7 


6 


1 


1 




1 








1 


10 


154 


70 


50 


14 


9 


9 




2 














11 


160 


71 


54 


15 


9 


9 




2 














12 


91 


49 


26 


6 


3 


5 


1 


1 














13 


118 


59 


34 


12 


6 


3 


1 


2 












1 


14 


138 


65 


43 


15 


6 


6 


1 


2 














15 


37 


23 


6 


4 


1 


3 


















16 


58 


33 


14 


4 


1 


6 


















17 


77 


41 


22 


6 


3 


3 


2 
















18 


149 


74 


41 


17 


7 


8 


1 


1 














19 


168 


80 


49 


21 


8 


8 




1 




1 










20 


166 


80 


50 


20 


7 


6 


1 


1 




1 










total 


2714 


1288 


848 


259 


137 


136 


10 


23 


4 


5 











4 



Tabic 6: Degree distributions of loops with contiguous stems. 



Index 





1 


2 


3 


4 


5 


6 


7 


8 9 10 


11 


12 


1 


39.3 


29.8 


7.9 


5.1 


5.1 




1.1 










2 


41.2 


26.6 


7.3 


4.0 


3.4 


0.6 


0.6 




0.6 




0.6 


3 


40.7 


29.9 


8.5 


5.1 


5.1 




0.6 


0.6 








4 


41.4 


27.0 


6.9 


4.0 


3.4 


0.6 


0.6 




0.6 




0.6 


5 


40.8 


29.3 


8.6 


5.2 


5.2 




1.1 










6 


40.0 


29.7 


8.0 


5.7 


4.6 




0.6 


0.6 








7 


40.0 


29.7 


8.0 


5.7 


4.6 




0.6 


0.6 








8 


42.3 


31.0 


8.3 


5.4 


5.4 




0.6 


0.6 








9 


43.5 


30.6 


8.2 


4.1 


3.5 


0.6 


0.6 




0.6 




0.6 


10 


41.2 


29.4 


8.2 


5.3 


5.3 




1.2 










11 


41.3 


31.4 


8.7 


5.2 


5.2 




1.2 










12 


25.3 


13.4 


3.1 


1.5 


2.6 


0.5 


0.5 










13 


33.0 


19.0 


6.7 


3.4 


1.7 


0.6 


1.1 








0.6 


14 


36.9 


24.4 


8.5 


3.4 


3.4 


0.6 


1.1 










15 


11.2 


2.9 


1.9 


0.5 


1.5 














16 


16.6 


7.0 


2.0 


0.5 


3.0 














17 


21.1 


11.3 


3.1 


1.5 


1.5 


1.0 












18 


41.8 


23.2 


9.6 


4.0 


4.5 


0.6 


0.6 










19 


48.5 


29.7 


12.7 


4.8 


4.8 




0.6 




0.6 






20 


48.2 


30.1 


12.0 


4.2 


3.6 


0.6 


0.6 




0.6 






total 


47.46 


31.25 


9.54 


5.05 


5.01 


0.37 


0.85 


0.15 


0.18 





0.15 



Table 7: Degree distributions of loops as percentages with contiguous stems. 
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A. 2 Picornaviral RNA 



Index 


Virus Name 


GcnBank Acc. # 


Length 


Degree 


1 


coxsackievirus B3 


M33854 


7396 


13 


2 


ECHO virus-22 


L02971 


7339 


17 


3 


encephalomyocarditis virus-A 


M81861 


7735 


25 


4 


foot-and-mouth disease virus-A12 


M10975 


8214 


1 


5 


hepatitis A virus-Hml75 


M14707 


7478 


20 


6 


rhinovirus-14 


K02121 


7212 


16 


7 


rhinovirus-16 


L24917 


7124 


11 


8 


Mcngovirus-M 


L22089 


7761 


26 


9 


poliovirus 1-Mahoney 


J0228140 


7440 


20 


10 


poliovirus 3-Sabin 


X00596 


7432 


22 


11 


Thcilcr's murine encephalomyelitis virus-Bean 


M16020 


8098 


37 



Table 8: Sequence information, including the degree of the external loop, 
for the 11 picornaviral sequences analyzed in [18] , available online through 
http : //www. virology .wise . edu/ acp/RNAFolds. 



Index 


sum 





1 


2 


3 


4 


5 


6 


7 


1 


499 


142 


281 


41 


20 


12 


3 






2 


478 


130 


275 


42 


23 


7 


1 






3 


530 


138 


320 


44 


16 


11 


1 






4 


594 


167 


337 


45 


25 


14 


3 


1 


2 


5 


482 


136 


267 


58 


12 


4 


3 


2 




6 


454 


126 


250 


37 


25 


5 


2 






7 


456 


140 


237 


46 


23 


6 


1 


3 




8 


494 


130 


300 


37 


17 


8 


1 


1 




9 


485 


143 


267 


43 


21 


8 


2 




1 


10 


507 


137 


293 


50 


20 


4 


2 


1 




11 


537 


157 


309 


38 


21 


9 


2 


1 




total 


5516 


1546 


3145 


481 


223 


88 


21 


9 


3 



Table 9: Degree distributions of loops. 



Index 





1 


2 


3 


4 


5 


6 


7 


1 


26.4 


52.3 


7.6 


3.7 


2.2 


0.6 






2 


24.2 


51.2 


7.8 


4.3 


1.3 


0.2 






3 


25.7 


59.6 


8.2 


3.0 


2.0 


0.2 






4 


31.1 


62.8 


8.4 


4.7 


2.6 


0.6 


0.2 


0.4 


5 


25.3 


49.7 


10.8 


2.2 


0.7 


0.6 


0.4 




6 


23.5 


48.2 


6.9 


4.7 


0.9 


0.4 






7 


26.1 


44.1 


8.6 


4.3 


1.1 


0.2 


0.0 




8 


24.2 


55.9 


6.9 


3.2 


1.5 


0.2 


0.2 




9 


26.6 


49.7 


8.0 


3.9 


1.5 


0.4 




0.2 


10 


25.5 


54.6 


9.3 


3.7 


0.7 


0.4 


0.2 




11 


29.2 


57.5 


7.1 


3.9 


1.7 


0.4 


0.2 




total 


28.03 


57.02 


8.72 


4.04 


1.60 


0.38 


0.16 


0.05 



Table 10: Degree distributions of loops as percentages. 
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Index 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


1 


68 


65 


45 


29 


9 


9 


9 


3 


6 


6 


8 


4 


3 


7 


2 


2 


68 


60 


33 


26 


15 


11 


8 


5 


9 


11 


3 


1 


4 


4 


1 


3 


83 


66 


52 


32 


20 


8 


9 


5 


9 


5 


5 


5 


4 


7 


2 


4 


99 


90 


48 


26 


16 


13 


9 


7 


6 


6 


2 


3 


3 


3 


2 


5 


51 


90 


34 


23 


10 


12 


6 


5 


7 


7 


8 


3 


2 


4 


1 


6 


67 


59 


33 


19 


12 


15 


4 


3 


6 


3 


4 


5 


6 


1 


6 


7 


49 


66 


21 


30 


12 


8 


7 


3 


5 


2 


3 


2 


2 


2 


4 


8 


63 


68 


38 


25 


29 


14 


6 


10 


10 


10 


3 


2 


4 


1 


5 


9 


55 


69 


36 


20 


14 


7 


12 


5 


12 


4 


2 


4 


3 


2 


4 


10 


68 


65 


34 


35 


15 


9 


10 


9 


13 


8 


4 


5 


3 


4 


1 


11 


79 


66 


42 


32 


20 


8 


10 


9 


6 


6 


5 


4 


4 


2 


1 



Tabic 11: Distribution of internal loops with at most 15 unpaired bases. 



Index 


16 


17 


18 


19 


20 


21 


22 


23 


24 


25 


26 


27 


28 


29 


30 


1 






1 






4 


3 


















2 


2 


2 


2 


3 


1 


1 


1 


1 


1 


1 




1 








3 


3 


1 


1 




1 






1 












1 




4 


3 
















1 














5 








1 






2 












1 






6 




2 


1 


3 


3 


1 


2 


1 






1 






1 


1 


7 


3 


1 


5 


2 


2 




1 


3 


1 




1 


2 








8 


1 


4 


1 


1 


1 






2 


1 












1 


9 


5 


3 


5 


2 


1 




1 


1 
















10 


3 


2 


1 




1 








1 








1 




1 


11 


2 


3 




2 


3 




1 






1 








2 


1 



Table 12: Distribution of large ( > 15 unpaired bases) internal loop sizes. 



Index 


sum 





1 


2 


3 


4 


5 


6 


7 


1 


366 


142 


148 


41 


20 


12 


3 






2 


350 


130 


147 


42 


23 


7 


1 






3 


381 


138 


171 


44 


16 


11 


1 






4 


405 


167 


148 


45 


25 


14 


3 


1 


2 


5 


341 


136 


126 


58 


12 


4 


3 


2 




6 


328 


126 


133 


37 


25 


5 


2 






7 


341 


140 


122 


46 


23 


6 


1 


3 




8 


363 


130 


169 


37 


17 


8 


1 


1 




9 


361 


143 


143 


43 


21 


8 


2 




1 


10 


374 


137 


160 


50 


20 


4 


2 


1 




11 


392 


157 


164 


38 


21 


9 


2 


1 




total 


4002 


1546 


1631 


481 


223 


88 


21 


9 


3 



Table 13: Degree distributions of loops with contiguous stems. 



Index 





1 


2 


3 


4 


5 


6 


7 


1 


35.1 


36.6 


10.1 


5.0 


3.0 


0.7 






2 


31.8 


35.9 


10.3 


5.6 


1.7 


0.2 






3 


35.6 


44.1 


11.3 


4.1 


2.8 


0.3 






4 


48.0 


42.5 


12.9 


7.2 


4.0 


0.9 


0.3 


0.6 


5 


34.3 


31.8 


14.6 


3.0 


1.0 


0.8 


0.5 




6 


30.7 


32.4 


9.0 


6.1 


1.2 


0.5 






7 


33.2 


28.9 


10.9 


5.5 


1.4 


0.2 


0.7 




8 


32.0 


41.6 


9.1 


4.2 


2.0 


0.2 


0.2 




9 


34.6 


34.6 


10.4 


5.1 


1.9 


0.5 




0.2 


10 


33.9 


39.6 


12.4 


5.0 


1.0 


0.5 


0.2 




11 


40.1 


41.8 


9.7 


5.4 


2.3 


0.5 


0.3 




total 


38.63 


40.75 


12.02 


5.57 


2.20 


0.52 


0.22 


0.07 



Table 14: Degree distributions of loops as percentages with contiguous stems. 
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